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Abstract 

Linear sketches are powerful algorithmic tools that turn an n-dimensional input into 
a concise lower-dimensional representation via a linear transformation. Such sketches 
have seen a wide range of applications including norm estimation over data streams, 
compressed sensing, and distributed computing. In almost any realistic setting, however, 
a linear sketch faces the possibility that its inputs are correlated with previous evaluations 
of the sketch. Known techniques no longer guarantee the correctness of the output in 
the presence of such correlations. We therefore ask: Are linear sketches inherently 
non-robust to adaptively chosen inputs? We give a strong affirmative answer to this 
question. Specifically, we show that no linear sketch approximates the Euclidean norm 
of its input to within an arbitrary multiplicative approximation factor on a polynomial 
number of adaptively chosen inputs. The result remains true even if the dimension of 
the sketch is d = n- o(n) and the sketch is given unbounded computation time. Our 
result is based on an algorithm with running time polynomial in d that adaptively finds 
a distribution over inputs on which the sketch is incorrect with constant probability. 
Our result implies several corollaries for related problems including ^ p -norm estimation 
and compressed sensing. Notably, we resolve an open problem in compressed sensing 
regarding the feasibility of ^ 2 /^2~ recoverv guarantees in presence of computationally 
bounded adversaries. 
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1 Introduction 



Recent years have witnessed an explosion in the amount of available data, such as that in data 
warehouses, the internet, sensor networks, and transaction logs. The need to process this data 
efficiently has led to the emergence of new fields, including compressed sensing, data stream 
algorithms and distributed functional monitoring. A unifying technique in these fields is the 
use of linear sketches. This technique involves specifying a distribution n over linear maps 
A : 1R" — > JR' for a value r «: n. A matrix A is sampled from n. Then, in the online phase, a 
vector x e 1R' ! is presented to the algorithm, which maintains the "sketch" Ax. This provides 
a concise summary of x, from which various queries about x can be approximately answered. 
The storage and number of linear measurements (rows of A) required is proportional to r. 
The goal is to minimize r to well-approximate a large class of queries with high probability. 

Applications of Linear Sketches. In compressed sensing the goal is to design a distribution 
7t so that for A ~ n, given a vector x e JR", from Ax one can output a vector x' for which 
ll* _ *1lp ^ C|l*fa;7(J:)llij/ where Xf «,-/(£) denotes x with its top k coefficients (in magnitude) 
replaced with zero, p and q are norms, and C > 1 is an approximation parameter. The scheme 
is considered efficient if r ^ k ■ poly(log«). There are two common models, the "for all" and 
"for each" models. In the "for all" model, a single A is chosen and is required, with high 
probability, to work simultaneously for all x e JR". In the "for each" model, the chosen A is 
just required to work with high probability for any fixed x elR". 

A related model is the turnstile model for data streams. Here an underlying vector x e JR" 
is initialized to 0" and undergoes a long sequence of additive updates to its coordinates of 
the form Xj <— Xj + b. The algorithm is presented the updates one by one, and maintains a 
summary of what it has seen. If the summary is a linear sketch Ax, then given an additive 
update to the i-th coordinate, the summary can be updated by adding 5A,- to Ax, where A, is 
the i-th column of A. The best known algorithms for any problem in this model maintain 
a linear sketch. Starting with the work of Alon, Matias, and Szegedy [AMS99], problems 
such as approximating the p-norm ||x|L = (H"=i |*;| p ) 1//p for 1 < p < oo (also known as the 
frequency moments), the heavy hitters or largest coordinates in x, and many others have been 
considered; we refer the reader to [Ind07, Mut05]. Often it is required that the algorithm be 
able to query the sketch to approximate the statistic at intermediate points in the stream, 
rather than solely at the end of the stream. 

Other examples include distributed computing [MFHH02] and functional monitoring 
[CMY11]. Here there are k parties P 1 ,...,P k , e.g., database servers or sensor networks, each 
with a local stream S 1 of additive updates to a vector x\ The goal is to approximate statistics, 
such as those mentioned above, on the aggregate vector x - YJi=i x '- If the parties share 
public randomness, they can agree upon a sketching matrix A. Then, each party can locally 
compute Ax 1 , from which Ax can be computed using the linearity of the sketch, namely, 

Ax = A(x l H h x k ). The important measure is the communication complexity, which, since 

it suffices to exchange the sketches Ax 1 , is proportional to r rather than to n. 

Adaptively Chosen Inputs. One weakness with the models above is that they assume the 
sketching matrix A is independent of the input vector x. As pointed out in recent papers 
[GHR + 12, GHS + 12], there are applications for which this is inadequate. Indeed, this occurs 
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in situations for which the result of performing a query on Ax influences future updates to 
the vector x. One example given in [GHR + 12] is that of a grocery store, in which x consists of 
transactions, and one uses Ax to approximate the best selling items. The store may update 
its inventory based on Ax, which in turn influences future sales. A more adversarial example 
given in [GHR + 12, GHS + 12] is that of using a compressed sensing radar on a ship to avoid 
a missile from an attacker. Based on Ax, the ship takes evasive action, which is viewed 
by the attacker, and may change the attack. The matrix A used by the radar cannot be 
changed between successive attacks for efficiency reasons. Another example arises in high 
frequency stock trading. Imagine Alice monitors a stream of orders on the stock market 
and places her own orders depending on statistics based on sketches. A competitor Charlie 
might have a commercial interest in leading Alice's algorithm astray by observing her orders 
and manipulating the input stream accordingly. The question of sketching in adversarial 
environments was also introduced and motivated in the beautiful work of Mironov, Naor and 
Segev [MNS08] who provide several examples arising in multiparty sketching applications. 
Even from a less adversarial point of view, it seems hard to argue that in realistic settings 
there will be no correlation between the inputs to a linear sketch and previous evaluations of 
it. Resilience to such correlations would be a desirable robustness guarantee of a sketching 
algorithm. 

A deterministic sketching matrix, e.g., in compressed sensing one that satisfies the "for 
all" property above, would suffice to handle this kind of feedback. Unfortunately, such 
sketches provably have much weaker error guarantees. Indeed, if one wants the number r 
of measurements to be on the order of k ■ poly(log«), then the best one can hope for is that 
for all x € W 1 , from Ax one can output x' for which ||x-x'|| 2 < ;4rl|Xfat7(fc)lll; which is known 
as the ij/^x error guarantee. However, if one allows the "for each" property, then there are 
distributions n over sketching matrices A for which for any fixed x e JR", from Ax one can 
output x' for which ||x - x'|| 2 ^ (1 + £)||x taJ 7(fc)|| 2 with high probability (over A ~ n), which 
is known as the £2/^2 error guarantee. One can verify that the second guarantee is much 
stronger than the first; indeed, for constant e and k = I, if x = (yfn,±\,±l,... ,±1), then with 
the £2/^1 guarantee, an output of x' - 0" is valid, while for the ^2/^2 guarantee, x[ must either 
be large or many coordinates of x' must agree in sign with those of x. 

An important open question, indeed, the first open question in the "Open Questions 
from the Workshop on Algorithms for Data Streams 2012 at Dortmund", is whether or not it 
is possible to achieve the ^2/^2 guarantee for probabilistic polynomial time adversaries with 
limited information about A. The weakest possible information an adversary can have about 
A is through black box queries. Formally, given a sketch Ax, there is a function f(Ax) for 
which its output satisfies a given approximation guarantee with high probability, e.g., in the 
case of compressed sensing, the guarantee would be that f{Ax) satisfies the ^2/^2 guarantee 
above, while in the case of data streams, the guarantee may be that f{Ax) = (1 ± e)||x|| p . The 
adversary only sees values f {Ax 1 ), f (Ax 2 ), . . . , f (Ax*) for a sequence of vectors x ,...,x f of her 
choice, where x 1 may depend on x 1 ,.. .,x'~ l and f(Ax l ), . . . ,f(Ax l ~ l ). The goal of the adversary 
is to find a vector x for which f(Ax) does not satisfy the approximation guarantee. This 
corresponds to the private model of compressed sensing, given in Definition 3 of [GHS + 12]. 

^See http : / / Is 2- www. cs . tu-dortmund . de/streamingWS20 12/ slides /open . problems_dortmund201 2 . 
pdf 
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1.1 Our Results 



We resolve the above open question in the negative. In fact, we prove a much more general 
result about linear sketches. All of our results are derived from the following promise 
problem GapNorm(B): for an input vector x e 1R", output if ||x|| 2 ^ 1 and output 1 if 
||x|| 2 > B, where B > 1 is a parameter. If x satisfies neither of these two conditions, the output 
of the algorithm is allowed to be or 1. 

Our main theorem is stated informally as follows. 

Theorem 1.1 (Informal version of Theorem 5.13). There is a randomized algorithm which, given 
a parameter B > 2 and oracle access to a linear sketch that uses at most r = n — 0(log(«B)) rows, 
with high probability finds a distribution over queries on which the linear sketch fails to solve 
GapNorm(BJ with constant probability. 

The algorithm makes at most poly(rB) adaptively chosen queries to the oracle and runs in time 
poly(rB). Moreover, the algorithm uses only r "rounds of adaptivity" in that the query sequence 
can be partitioned into at most r sequences of non-adaptive queries. 

Note that the algorithm in our theorem succeeds on every linear sketch with high proba- 
bility. In particular, our theorem implies that one cannot design a distribution over sketching 
matrices with at most r rows so as to output a value in the range [||x|| 2 ,.B||x|| 2 ], that is, a 
^-approximation to ||x|| 2 , and be correct with constant probability on an adaptively chosen se- 
quence of poly(rB) queries. This is unless the number r of rows in the sketch is n — 0(log(nf>)), 
which agrees with the trivial r - n upper bound up to a low order term. Here B can be 
any arbitrary approximation factor that is only required to be polynomially bounded in 
n (as otherwise the running time would not be polynomial). An interesting aspect of our 
algorithm is that it makes arguably very natural queries as they are all drawn from Gaussian 
distributions with varying covariance structure. 

We also note that the second part of our theorem implies that the queries can be grouped 
into fewer than r rounds, where in each round the queries made are independent of each 
other conditioned on previous rounds. This is close to optimal, as if o(r/logr) rounds were 
used, the sketching algorithm could partition the rows of A into o(r/logr) disjoint blocks 
of cu(logr) coordinates, and use the z-th block alone to respond to queries in the z-th round. 
If the rows of A were i.i.d. normal random variables, one can show that this would require 
a super-polynomial (in r) number of non-adaptive queries to break, even for constant B. 
Moreover, our theorem gives an algorithm with time complexity polynomial in r and B, 
and therefore rules out the possibility of using cryptographic techniques secure against 
polynomial time algorithms. 

We state our results in terms of algorithms that output any computationally unbounded 
but deterministic function / of the sketch Ax. However, it is not difficult to extend all of our 
results to the setting where the algorithm can use additional internal randomness at each 
step to output a randomized function / of Ax. This is discussed in Section 7.1. 

Applications. We next discuss several implications of our main theorem. Our algorithm 
in fact uses only query vectors x which are 0(r)-dimensional for B < exp(r). Recall that for 
such vectors, Q(r~ 1/2 ||x|| 2 ) < ||x|| p < 0(r 1/2 ||x|| 2 ), for all 1 ^ p ^ oo. This gives us the following 
corollary for any £„-norm. 
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Corollary 1.2 (Informal). No linear sketch with n - <x>(log n) rows approximates the £ p -norm to 
within a fixed polynomial factor on a sequence of polynomially many adaptively chosen queries. 

The corollary also applies to other problems that are as least as hard as ^ p -norm estimation, 
such as the earthmover distance, or that can be embedded into £ p with small distortion. 

Via a reduction to GapNorm(B), we are able to resolve the aforementioned open question 
for sparse recovery even when k = l. 

Corollary 1.3 (Informal). Let C > 1. No linear sketch with o(n/C 2 ) rows guarantees t^^i' 
recovery on a polynomial number of adaptively chosen inputs. More precisely, we can find with 
probability 2/3 an input x for which the output x' of the sketch does not satisfy ||x-x'|| 2 ^ 

C||x t ail(l)ll2- 

For constant approximation factors C, this shows one cannot do asymptotically better 
than storing the entire input. For larger approximation factors C, the dependence of the 
number of rows on C in this corollary is essentially best possible (at least for small k), as we 
point out in Section 7.3. 

Connection to Differential Privacy. How might one design algorithms that are robust 
to adversarial inputs? An intriguing approach is offered by the notion of differential 
privacy [DMNS06]. Indeed, differential privacy is designed to guard a private database 
D e {0,1}" (here thought of as n private bits) against adversarial and possibly adaptive 
queries from a data analyst. Intuitively speaking, differential privacy prevents an attacker 
from reconstructing the private bit string. In our setting we can think of D as the random 
string that encodes the matrix used by the sketching algorithm and indeed our algorithm is 
precisely a reconstruction attack in the terminology of differential privacy. It is known that if D 
is chosen uniformly at random, then after conditioning D on the output of an ^-differentially 
private algorithm, the string D is a strongly le-unpredictable 2 random string [MMP + 10]. 
Hence, if the answers given by the sketching algorithm satisfy differential privacy, then the 
attacker cannot learn the randomness used by the sketching algorithm. This could then be 
used to argue that the sketch continues to be correct. 

An interesting corollary of our work is that it rules out the possibility of correctly an- 
swering a polynomial number of "GapNorm queries" using the differential privacy approach 
outlined above. This stands in sharp contrast to work in differential privacy which shows that 
a nearly exponential number of adaptive and adversarial "counting queries" can be answered 
while satisfying differential privacy [RR10, HR10]. A similar (though quantitatively sharper) 
separation was recently shown for the stateless mechanisms [DNV12] answering counting 
queries. While linear sketches are stateless, the model we use here in principle permits more 
flexibility in how the randomness is used by the algorithm so that the previous separation 
does not apply. 

1.2 Comparison to Previous Work 

While the above papers [GHR + 12, GHS + 12] introduce the problem, the results they obtain do 
not directly address the general problem. The main result of [GHS + 12] is that in the private 

2 This means that each bit of D is at most 2e-biased conditioned on the remaining bits. 
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model of compressed sensing, the ^2/^2 error guarantee is achievable with r - k- poly(log«) 
measurements, under the assumption that the algorithm has access to the exact value of ||x||| 
as well as specific Fourier coefficients of x (or approximate values to these quantities that 
come from a distribution that depends only on the exact values). While in some applications 
this may be possible, it is not hard to show that this assumption cannot be realized by any 
linear sketch unless r > n (nor by any low space streaming algorithm or low communication 
protocol). The main result in [GHR + 12] relevant to this problem is that if the adversary 
can read the sketching matrix A, but is required to stream through the entries in a single 
pass using logarithmic space, then it cannot generate a query x for which the output of the 
algorithm on Ax does not satisfy the £2^2 error guarantee. This is quite different from the 
problem considered here, since we consider multiple adaptively chosen queries rather than a 
single query, and we do not allow direct access to A but rather only observe A through the 
outputs /(Ax 1 ). 

We note that other work has observed the danger of using the output f(Ax) to create 
an input x' for which the value f(Ax') is used [AGM12a, AGM12b]. Their solution to this 
problem is just to use a new sketching matrix A' drawn from n, and instead query f(A'x'). As 
mentioned, it may not be possible to do this, e.g., if x' is a perturbation to x, one would need 
to compute A'x' without knowing x' (since x may only be known through the sketch Ax). 
Other work [IPW11, PW13] has also considered the power of adaptively choosing matrices 
to achieve fewer measurements in compressed sensing; this is orthogonal to our work since 
we consider adaptively chosen inputs rather than adaptively chosen sketches. 

Sketching in adversarial environments was also the motivation for [MNS08]. However, 
they consider an adversarial multi-party model that is different from ours. 

1.3 Our Techniques and Proof Overview 

We prove our main theorem by considering the following game between two parties, Alice 
and Bob. Alice chooses an r x n matrix A from distribution n. Bob makes a sequence of 
queries x l ,...,x s e M" to Alice, who only sees Ax 1 on query i. Alice responds by telling 
Bob the value f(Ax'). We stress that here / is an arbitrary function here that need not be 
efficiently computable, but for now we assume that / uses no randomness. This restriction 
can be removed easily as we show later. Bob's goal is to learn the row space R(A) of Alice, 
namely the at most r-dimensional subspace of IR" spanned by the rows of A. If Bob knew 
R(A), he could, with probability 1/2 query 0" and with probability 1/2 query a vector in the 
kernel of A. Since Alice cannot distinguish the two cases, and since the norm in one case is 
and in the other case non-zero, she cannot provide a relative error approximation. Our main 
theorem gives an algorithm (which can be executed efficiently by Bob) that learns r - O(l) 
orthonormal vectors that are almost contained in R{A). While this does not give Bob a vector 
in the kernel of A, it effectively reduces Alice's row space to be constant dimensional thus 
forcing her to make a mistake on sufficiently many queries. 

The conditional expectation lemma. In order to learn R(A), Bob's initial query is drawn 
from the multivariate normal distribution N(0,rl n ), where xl n is the covariance matrix, 
which is just a scalar t times the identity matrix I n . This ensures that Alice's view of Bob's 
query x, namely, the projection P^x of x onto R(A), is spherically symmetric, and so only 
depends on H-P^lb- Given ||P^x|| 2 , Alice needs to output or 1 depending on what she thinks 
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the norm of x is. The intuition is that since Alice has a proper subspace of IR", she will be 
confused into thinking x has larger norm than it does when ||Pa*II2 is slightly larger than its 
expectation (for a given x), that is, when x has a non-trivial correlation with R(A). Formally, 
we can prove a conditional expectation lemma showing that there exists a choice of x for 
which E x _N(o,Tid r ) ll^ > A x ll2 I f(Ax) = l]-7E x ~N(0,Tid r ) [ll^A*!^] i s non-trivially large. This is done 
by showing that the sum of this difference over all possible x in a range [1,B] is noticeably 
positive. Here B is the approximation factor that we tolerate. In particular, there exists a 
t for which this difference is large. To show the sum is large, for each possible condition 
v - ||Pa x II2' there is a probability q(v) that the algorithm outputs 1, and as we range over all 
x, q(v) contributes both positively and negatively to the above difference based on v's weight 
in the ^-distribution with mean r ■ x. The overall contribution of v can be shown to be zero. 
Moreover, by correctness of the sketch, q(v) must typically be close to for small values of v, 
and typically close to 1 for large values of v. Therefore q(v) zeros out some of the negative 
contributions that v would otherwise make and ensures some positive contributions in total. 



Boosting a small correlation. Given the conditional expectation lemma we we can find 
many independently chosen x 1 for which each x ! has a slightly increased expected projection 
onto Alice's space R(A). At this point, however, it is not clear how to proceed unless we can 
aggregate these slight correlations into a single vector which has very high correlation with 
R(A). We accomplish this by arranging all m = poly(«) positively labeled vectors x 1 into an 
mxn matrix G and computing the top right singular vector v* of G. Note that this can be 
done efficiently. We show that, indeed, ^ 1 _ l/poly(n). In other words v* is almost 

entirely contained in R(A). This step is crucial as it gives us a way to effectively reduce the 
dimension of Alice's space by 1 as we will see next. 



Iterating the attack. After finding one vector inside Alice's space, we are unfortunately 
not done. In fact Alice might initially use only a small fraction of her rows and switch to a 
new set of rows after Bob learned her initial rows. We thus iterate the previously described 
attack as follows. Bob now makes queries from a multivariate normal distribution inside of 
the subspace orthogonal to the the previously found vector. In this way we have effectively 
reduced the dimension of Alice's space by 1, and we can repeat the attack until her space is 
of constant dimension, at which point a standard non-adaptive attack is enough to break the 
sketch. Several complications arise at this point. For example, each vector that we find is only 
approximately contained in R(A). We need to rule out that this approximation error could 
help Alice. We do so by adding a sufficient amount of global Gaussian noise to our query 
distribution. This has the effect of making the distribution statistically indistinguishable 
from a query distribution defined by vectors that are exactly contained in Alice's space. Of 
course, we then also need a generalized conditional expectation lemma for such distributions. 



Paper Outline 

We start with some technical preliminaries in Section 2. We then prove the conditional 
expectation lemma in Section 4. The proof of this lemma requires rather detailed information 
about averages of ^-distributions in certain intervals. The development of these bounds is 
contained in Section 3. In Section 5 we present and analyze our complete adaptive attack. 
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The proof again requires several technical ingredients. One tool (given in Section 4.2) relates 
a distance function between two subspaces to the statistical distance of certain distributions 
that we use in our attack. The other tool in Section 6 analyzes the top singular vector of 
certain biased Gaussian matrices arising in our attack. In Section 7 we give applications to 
compressed sensing, data streams, and distributed functional monitoring. 

2 Preliminaries 

Notation. Given a subspace V C ]R", we denote by P v the orthogonal projection operator 
onto the space V. The orthogonal complement of a linear space V is denoted by V- 1 . When X 
is a distribution we use x ~ X to indicate that x is a random variable drawn according to the 
distribution X. 

Linear Sketches. A linear sketch is given by a distribution M over r xn matrices and an 
evaluation mapping F : IR rx " x ]R r — > R where R is some output space which we typically 
choose to be R - {0, 1}. The algorithm initially samples a matrix A ~ M. The answer to each 
query x e W is then given by F(A,Ax). Since the evaluation map F is not restricted in any 
way the concrete representation of A as a matrix is not important. We will therefore identify 
A with its image, an r-dimensional subspace of JR" (w.l.o.g. A has full row rank). In this 
case, we can write an instance of a sketch as a mapping / : ]R" — > R satisfying the identity 
f[x) = f(P^x). In this case we may write /: A — > {0, 1} even though / is defined on all of JR ,! 
via orthogonal projection onto A. 

Distributions. We denote the d-dimensional Gaussian distribution with mean \i 6 M d and 
independent coordinates with variance a 2 6 JR by N(p,o 2 ) d . The statistical distance (or total 
variation distance) between two distributions X, Y is denoted by ||X - Y|| tv - 

3 Certain Averages of ^-distributions 

In this section we develop the main technical ingredients for our conditional expectation 
lemma. Specifically, we will work in lR d and consider weighted averages of the ^-distribution 
in certain intervals. The density function of the squared Euclidean norm of a d-dimensional 
standard Gaussian variable is given by v(s) = s d/2 - l e- s/2 /2 d/2 T{d/2). We let v r4 : [0, do) -> [0, 1] 
be the density function of a ^-distribution with d-degrees of freedom and expectation x. 
Note that this coincides with the density function of the squared norm of a d-dimensional 
Gaussian variable N(0, x/d) d which we will denote as: 



Here we used that v r j(s) - dv(sd/r)/r. We will omit the subscript d whenever it is clear 
from the context. Further, let denote the probability measure on [0, oo) of the Gamma 
distribution given by the density : (0, oo) — > [0,1], 



Vt,«»(s) = 



j(sd\ d/2 ~ l -A 



(1) 



r2 d / 2 Y(d/2) 



Td(x) = 



x 



(2) 



T(d) 



9 



Lemma 3.1. Let < a ^ b and let s > 0. Then, 



r 
r 



sv r (s)dx = 



rv T (s)dT = 



1 -21 d 



l d/2-l 



2b' 2a 



l-6/d + S/d 



2 ' r d/2-2 



sd sd 
2b' 2a 



(3) 
(4) 



Proof. Applying Equation 1 and substituting x - sd/2x, we have ^ T - n; 
that idr = -i dx. Put a' = sd/2b and fc' = sd/2a. Thus, 



dx - sd It follows 



v T (s)dr = 

Jh Jh 



«*(t) g2t dT = 
T 2 d / 2 r(d/2) T ^„ 



6' ^ v d/2-2 e -x 



r ^ 



2T(d/2) 



■ dx. 



On the other hand, T(d/2) = (d/2 - 1 )T(d/2 - 1 ) . Hence, 



The second equation is shown similarly, again substituting x - sd/2x and noting that dr 

sd 2 T d/2 _ 2 ([a',b'}) 



r 

xv T (s)dx = 

J a 

Furthermore, 



J 

J a 



d[f\ e 2r 



2 d ' 2 T{d/2) 



dx 



-I 

J a 



b ' sd 2 x d/2 - 3 e~ x 



T(d/2) 



dx - 



d< 



4(d/2-l)(d/2-2)' 



1 



4(d/2-l)(d/2-2) 4(1/4 - l/2d - 1/d + 2/d 2 ) (l-6/d + 8/d 2 ) 



Let us introduce the function A: [0, do) — > IR^o defined as 



A(s 



def r 



(s - T)v T (s)dr. 



(5) 



Here B > 4 is some parameter that we will choose later. Figure 1 illustrates the behavior of 
this function. The next lemma states the properties of A that we will need. 

Lemma 3.2. Assume d ^ 20. Then, for every s e [0,Bd/2], we have that A(s) < 0. Moreover, for 
every s e [d, 2d], we have A(s) < -s/3d. 

Proof First consider the case where s € [2d,Bd/2]. By Lemma 3.1, we have 



A(s) = 



1 -21 d 



l d/2-l I 



(.2B'2.) l-6/d + 8/d 2 ' Td/2 ~ 2 (.2B' 2.) 



On the other hand, for this choice of s, we have 



Id/2- 2 



S S 

2B'2 



j > Li/2-2 



:,d 



>l-l/d 



10 



-4 



Figure 1: A(s) plotted for d = 20 and B- 4. 



Here, the last step can be verified directly by using that 1^/2-2 is strongly concentrated around 
its mean d/2 - 2 and has variance bounded by yd. For d > 20, the approximation we used is 
valid. Hence, 

a / \ I 1 1 - Vd \ I l-l/d l-l/d \ s 

HSXSI——J- - \ = a[ - — — — < 



1-2/d l-6/d + 8/d 2 / \l-3/d + 2/d 2 l-6/d + 8/d 2 / d' 

Here we used our lower bound on d again. 

Now consider the case s e [d-4, 2d]. In this case we have Td/ 2 - 2 ([s/2B,s/2]) ^ Td/ 2 _ 2 ([d/B, d/2- 
2]) ^ 1/2 - l/d by concentration bounds for T and using that the median of r^/ 2 _2 is at 
most d/2 - 2. For the same reason, T d/2 ^ 2 ([s/2B, s/2]) > 1/2 - l/d. Moreover, we have that 
r d/2 _ 2 ([5/2B,s/2]) > T m - 1 ([s/2B,s/2]) because r# 2 -i([«*/2-2,oo)) > T d/2 _ 2 {[d/2-2,oo}). This 
follows because r^/ 2 _i has larger mean and greater variance than T^/i-i- Hence, 

> s(V2 - Vd) - 1 _ 6/ ; + 8/d2 ) < - JL + JL < - J. . 

Finally, let s e [0, d - 4]. In this case we have [s/2f>,s/2] C [0, d/2 - 2]. But, for every 
x € [0, d/2 - 2], we have 

(d/2-2\ 

Yin-vM - Yd/2-i(x) — - — I > Yd/2-1 (*) . 

Hence, A(s) < 0. 



Our main lemma in this section is stated next. 

Lemma 3.3. Let d ^ do for a sufficiently large constant d . Let B > 4. Let h : [0, 00) — > [0, 1] be any 
function satisfying the properties: 

h \Z 2 (l-h(s))&s<l/Bd, 

-2d 



2. J Q /z(s)ds<l/d. 
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Then, we have 

roo r^U i 

(s-r)v z (s)h(s)drds> -. (6) 

Js=oJt=/ 4 

Proof. First observe that 

{s-T)v T {s)h{s)dxds = h(s)A(s)ds 

Js=0Jr=l Js=0 

Moreover, since J °°sv T (s)ds = x, we have 

CO 

A(s) = 0. (7) 



i: 



>s=0 

Let us consider the three intervals 

L = [Q,2d), M = [2d,Bd/2), U = [t,oo). 
Claim 3.4. Without loss of generality, § M h(s)A(s)ds > J M A(s)ds 

Proof. Lemma 3.2 tells us that A(s) < for all s € M. Since we're interested in lower bounding 
J h(s)A(s), we can therefore assume without loss of generality that h(s) - 1 for all s e M. ■ 

Claim 3.5. \ u h(s)A(s) ds > \ u A(s) ds-6. 

Proof. The claim follows from the first condition on h which implies that h(s) - 1 almost 
everywhere in the interval I = [Bd/2, 2Bd]. In particular, 

J h(s)A(s) ds = J A(s) ds + J( 1 - fc(s))A(s) ds 

> | A(s)ds- -^max|A(s)| 3* | A(s)ds-4. 

Here we used that |A(s)| < ABd. Moreover, for every r e [d,Bd] we have that \y 2Bd ^ rv T (s)ds ^ 
\/2Bd by standard tail bounds for v T and sufficiently large d . This implies 

J h{s)A{s)ds> J"fc(s)A(s)ds-l. 

Similarly J A(s) ^ J" A(s) - 1. The claim follows by combining these statements. ■ 
Claim 3.6. j L h{s)A(s)ds > \ L A(s) + d/3 - 4. 

Proof. Here we use the second condition on the claim which implies 

J h(s)A{s)ds > -max|A(s)| J h(s) > -4, 

where we used that |A(s)| ^ Ad in this range. On the other hand, by Lemma 3.2, A(s) < for 
all s € L and for s € [d, 2d] we have A(s) < -s/3d. Hence, 



jA(s)ds^-^J sds< 
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Combining all three claims we get 



f 

Jo 



h(s)A(s)ds> I A(s)ds + ^-10 = ^-10, 
'o 3 3 



where we used Equation 7 in the last step. For sufficiently large d, we have d/3 - 10 ^ d/4 
and the lemma follows. ■ 



4 Conditional Expectation Lemma 

The key tool in our algorithm is what we call the conditional expectation lemma. Informally 
it shows that we can always find a distribution over inputs that have a non-trivially large 
correlation with the unknown subspace used by the linear sketch. Our presentation here, 
however, will not need the interpretation in terms of linear sketches. Fix a d-dimensional 
linear subspace U C ]R". Throughout this section, we think of d as being lower bounded by 
a sufficiently large constant. We will consider functions of the type /: IR' ! — > {0, 1} which 
satisfy the identity f(x) = f(P\jx) for all x 6 1R". To indicate that this identity holds we will 
write /: U — > {0, 1}. As explained in Section 2, we can think of these functions as instances of 
a linear sketch. Our presentation here will not need this fact though. 

Definition 4.1 (Subspace Gaussian). Let U c 1R" be a linear subspace of 1R". We say that a 
family of distributions G(U) = {g r } r e(o,co) is a subspace Gaussian family if 

1. Pugr i s distributed like a standard Gaussian variable inside U satisfying E||P[/g T || 2 = x. 

2. Pjj^gr is a spherical Gaussian distribution that does not depend on x and is moreover 
statistically independent of PjjSx- 

Lemma 4.2. The norm ll-P^g-rll 2 zs a sufficient statistic for a subspace Gaussian family Q{U) = 
{g T } T . Formally, for every s > 0, the distribution of g T is independent of x under the condition that 

5 = WPugrW 2 - ' 

Remark 4.3. The reader concerned about the condition \\Pugr\\ 2 - s (which has probability 
under g T ) is referred to the excellent article of Chang and Pollard [CP97] (Example 6) where it is 
shown how to formally justify this conditional distribution using the notion of a disintegration. 
Specifically, here we mean that the distributions g T have a joint disintegration in terms of the 
variable \\Pugr\\ 2 - As explained in [CP97], we can prove the above lemma by appealing to the 
factorization theorem for sufficient statistics described therein. 

Proof of Lemma 4.2. Note that g T = gi + g 2 where gj is some distribution independent of x 
supported on U 1 - and gi is supported on U. By spherical symmetry of both g 1 and g 2 we may 
assume without loss of generality that U is a coordinate subspace, say, the first d - dim(!7) 
coordinates of the standard basis. Since g is independent of g r and supported on a disjoint set 
of coordinates, it suffices to verify the claim for . Specifically, by the Factorization theorem 
for sufficient statistics (see [CP97]), we need to show that the density of gi can be factored 
into the product of two functions such f(x) and h x (x) that / does not depend on x and h r (x) 
depends on x but is a function of the parameter ||x|| 2 . This follows directly from the fact that 
the Gaussian density at a point x depends only on ||x|| 2 . ■ 
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The following definition captures the condition that / should evaluate to 1 on inputs that 
have large norm and should evaluate to on inputs that have small norm. 

Definition 4.4 (Soundness). We say that a function / : U — > {0, 1} is B-sound for a subspace 
Gaussian family Q{U), where dim(!7) = d, if it satisfies the requirements: 

L Swifter) I H^StII 2 = s] ds < l/Bd. 

2. S™E[f{g x )\\\Pugx\\ 2 = s\ds<l/d. 

We are ready to state and prove the Conditional Expectation Lemma. 

Lemma 4.5. Let B ^ 4. Let G(U) be a subspace Gaussian family where U has dimension sufficiently 
large dimension d ^ d . Suppose f : U — > {0, 1} is B-sound for Q{U). Then, there exists r € [d,Bd] 
such that 



1. E 



WPugrW 2 f{gr) = 1 



>^[\\Pugr\\ 2 ]+is 

Proof. Define the function h : (0, do) — > 1R by putting 

h(s) = TE[f(g x )\\\P ugT \\ 2 = s]. 



(8) 



Note that this is well-defined by Lemma 4.2. Let us first rewrite the conditional expectation 
as follows. 



E 



WPugrW 2 f(gr) = 1 



J () 5P{||P^ r || 2 = 5|/(^) = l}d5 

CO / \ 

5P{/(^) = 1|||P^I| 2 =S}-- '' l(S) 



■X 



{/(*r) = l 



-ds (by Bayes' rule) 



s/z(s)v T (s) 



ds 



/o P{/(ft) = l} 

Note that here and in the following v T = v T ^, i.e., the ^-distribution has d degrees of 
freedom corresponding to the dimension of U. 

Claim 4.6. The lemma follows from follows from the following inequality: 



n; 



(s - T)v T (s)/i(s)dsdr ^ 



(9) 



Proof Indeed, assuming the above inequality it follows that there must be a x 6 [d,Bd] such 
that 



r^°° c 00 *i i 

sv T (s)/i(s)ds>T v T (s)/i(s)ds + — = TPj/fe) = l} + 
Jo Jo 



4Bd 



AB 



In particular, 



E 



Iferll 2 f(gr) = 1 



Jo P 



s/j(5)v t (s)_ ^ > TE{/(g T ) = l} + l/4B > ^ + 



1/4B 



{/(& 



IP {/(St 



JP{/(gr) = l) 
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Since 7P{f{g r ) = 1 } =^ 1, this gives us the first conclusion of the lemma. It remains to lower 

C CO 1 c CO 

bound P{/(g T ) = 1}- Here we use that j 5Bd sv T (s)h(s)ds ^ j j sv T (s)h(s)ds, by standard con- 
centration properties of v T . Hence, 



/~>CO /""CO 

P{/(Sr) = l} = J v ^ h ^ ds >Twd] Q sv r (s)h(s)ds> 



1/4B 1 



lOBd 40B 2 d' 



By the previous claim, it suffices to prove Equation 9. To do so we will apply Lemma 3.3. 
The lemma in fact directly implies the claim, if we can show that h satisfies the properties 
required in Lemma 3.3. It is easily verified that these properties coincide with the soundness 
assumption on /. Thus, 



Jz Jo 



d d 

(s - r)v T (s)h(s) ds dr > — - 2 > — 
o 3 4 



This concludes the proof of Lemma 4.5. ■ 

The following corollary is a direct consequence of Lemma 4.5 that states that there is one 
direction in the subspace that has increased variance. 

Corollary 4.7. Let G(U) satisfy the assumptions of Lemma 4.5. Then, there is x 6 [d, Bd] and a 
vector u e U, satisfying, 



1. E 



(U,gr) 2 f(gr) = 1 



> 



E [<"'^) 2 ] + 4M 



2. V{f {&) = !}> ^ 



Proof. Pick an arbitrary orthonormal basis U\,...,u^ of U. Since H-Pug-rll 2 = H,i=i( u i>g?) 2 > one 
of the basis vectors must satisfy the conclusion of the lemma by an averaging argument. ■ 



4.1 Noisy orthogonal complements 

In this section we extend the conditional expectation lemma to a family of distributions that 
will be important to us later on. We will fix a r-dimensional subspace A C ]R ,! . The family of 
distributions we will define next isn't subspace Gaussian on A, but rather subspace Gaussian 
on A n V- 1 , where V c A is a linear subspace of A of dimension dim(V) < d - 1. 

A distribution in this family is given by a subspace V and a variance a 2 . Intuitively, the 
distribution corresponds to a Gaussian distribution on the subspace V 1 - of variance a 2 plus a 
small Gaussian supported on all of JR ,! of constant variance independent of a 2 . The formal 
definition is given next. 

Definition 4.8. Let a > 0. Given a subspace V c A of dimension t < r — 1 and let d - 
r-t. We define the distribution G(V- L , a 2 ) as the distribution obtained from sampling gi ~ 
N(0,a 2 ) n ,g 2 ~ N(0, 1/4)" independently outputting g = Py^gi+gi- 

Further we define the family of distributions Q(A n V- 1 ) = {g r } by letting g r = P^g where 
g ~ G(V- L ,r/d - 1/4) if x/d > 1/4 and otherwise we put g T - P A g where g ~ N(0,x/d) n 
otherwise. 
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The next lemma confirms that Q(A n V- 1 ) is subspace Gaussian. 
Lemma 4.9. Q(A n V- 1 ) is a subspace Gaussian family. 

Proof. Let U - An V- 1 . For t ^ 1/4, it is clear that E||P[/g T || 2 = dx/d - x as is required and 
||P[/±g T || = 0. For t > 1/4, recall that g - Py^gi + gi- Hence, inside Pjjg is distributed like a 
spherical Gaussian with variance x/d in each direction. In particular, 

\\Pug?\\ 2 = \\Pu(gi+g2)\\ 2 = dx/d = x. 

On the other hand, Pu^g only depends on g 2 and is hence independent of x. This shows the 
second property of subspace Gaussian. ■ 

The next definition captures the correctness requirement on / for inputs drawn from the 
distribution GiV^^ 2 ). 

Definition 4.10 (Correctness). We say that a function /: A — > {0, 1} is (e,B)-correct on V 1 - 
with d - dim(V x n A) if: 

1 . for all a 2 € [B/2, 2B] and g ~ G( V ± , a 2 ) we have IP {f(g) = 1} > 1 - £ 

2. for all a 2 € [0,2] and g ~ G(V ± ,a 2 ) we have IP {/(g) = 1} < e. 

We say that / is B-correct on V- 1 if it is (e, B)-correct for some £ < l/10(Bd) 2 . 

We will now relate the correctness definition to our earlier soundness definition. 
Lemma 4.11. Iff is B-correct on V- 1 , then f is B-sound for Q(A n V- 1 ). 

Proof. We will prove the claim in its contrapositive. Indeed suppose that / is not B- 
sound for Q(A n V- 1 ). This means that one of the two requirements in Definition 4.4 is 
not satisfied. Suppose it is the first one. In this case we know that for I = [Bd/2,2Bd] 
and h(s) = TP\f(g x ) = 1 I llgrll 2 = s }> we have E se j(l - h(s)) > l/2(Bd) 2 . Suppose we sample 
g ~ G(V- L , a 2 ) where a 2 is chosen uniformly at random from B/2, 2B. We claim that that the 
distribution of ||g|| 2 is pointwise within a factor 5 of the uniform distribution inside the inter- 
val [B/2,2B]. Hence, E/z(||g|| 2 ) > l/10(Bd) 2 . This violates the first condition of correctness. 

The case where the second requirement of Definition 4.4 is violated follows from an 
analogous argument. ■ 

Below we state a variant of the conditional expectation lemma for distributions of the 
above form. Moreover, we will remove the requirement that t ^ r - d and obtain a result 
that applies to any t ^ dim(A). 

Lemma 4.12. Let A c 1R" be a subspace of dimension dim(A) = r ^ n - d for some sufficiently 
large constant d - Let V cAbe a subspace of A of dimension t < r. Suppose that f : A — > {0, 1 } is 
(l/lO(d B) 2 ,B)-correct on V- 1 . Then, there exists a scalar a 2 € [3/4, B], and a vector u e An V 1 - 
such that for g ~ G(V r - L , o 2 ) we have for d = maxjr - d, d } : 



1. E 



(u,g) 2 f(g) = l 



^ E [<"^) 2 ] + 4M 



2 - IP i/^) = 1 }>4ok' 
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Proof. Before we proceed we would like to ensure that dim(A n V- 1 ) is at least a sufficiently 
large constant d . This can be ensured without loss of generality by considering instead a 
subspace A' D A of dimension r + d obtained by extending A arbitrarily to r + d dimensions. 
This can be done since n^r + d . Define the function f'(x) - f(P^x). Note that f'(x) = f(x) on 
all x € M". Hence, f is still (l/10(d B) 2 ,B)-correct on V- 1 . Moreover, now dimfV- 1 n A) = do- 
Hence, by Lemma 4.11, we have that / is sound for the subspace Gaussian family Q(A' n V- 1 ). 
Let U = A' n V- 1 . We can apply Corollary 4.7 to Q(U) to conclude that there is x e [d,Bd] and 
u e U such that 



E 



W,gr) 2 f'(gr) = l >K[(u,g r ) 2 ] + 



1 



ABd 



and E{/'(g T ) = 1} > 40 g 2< j . By definition of /' the condition f'(g T ) - 1 is equivalent to /(g T ) = 
1. The condition /(g r ) = 1 does not affect any vector that is orthogonal to A. Hence we may 
assume that ueAn V^. Also note that g x - P A >g for some g ~ G{V^,o 2 ) with a 2 e [3/4,B]. 
Moreover, f(g) = f'{g r ), and also (u,g r ) - (u,g) since u eU. Hence, we have 



E 



(u,g? fig) = 1 



>E[<«,g> 2 ] + ^ 



4Bd 



with P {/(g) = 1) 



l 



40B 2 d 



This is what we wanted to show. 



4.2 Distance between subspaces 

Our goal is to relate distributions of the form G(V- L , a 2 ) to G(W- L , a 2 ) where V and W are 
subspaces. For this purpose, we consider the following distance function d(V, W) between 
two subspaces V, W C IR" : 

d(V, W) = \\P V -P W \\ 2 = SUp (10) 

We will show that if V and W are close in this distance measure, then the two distributions 
G(V- L , a 2 ) and G(W- L , a 2 ) are statistically close. Recall that we denote the statistical distance 
between two distributions X, Y by ||X - Y\\ w . We need the following well-known fact. 

Fact 4.13. Let v e W. Then, 

\\N(0,o 2 ) n -N(v,o 2 ) n \\ tv ^^-. 

a 

Using this fact we can express the statistical distance between G(y J -,a 2 ) and G(W- L ,a 2 ) 
for two subspaces V, W in terms of the distance d(V, W). 

Lemma 4.14. For every a 2 6 (0,B], we have 

||G( V\a 2 ) - G(W^,o 2 )\\ tv < 20y/Bn\og(Bn)-d(V, W) + . 

Proof Sample g\ ~ N(0, a 2 )" and g2>g 2 ~ N(0, 1/4)" independently. Let us denote by x - 
Py^gi+gi and by y = Pyv^gi+gr Note that x is distributed like a random draw from G(V rJ -,a 2 ) 
and y like a draw from G(W- L ,a 2 ). However, we introduced a dependence throw gj. Note 
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that it is sufficient to bound the statistical distance of these coupled variables. On the one 
hand, 

II^Wl -iWill = WPvgi-PwgiW < \\gi\\-d(V,W) 
On the other hand, by Gaussian concentration bounds, 

TP{\\ gl \\ < lOVBnlog(Bn)} < 

Condition on \\gi\\ < lO-JBnlog(Bn). Under this condition, for every possible value u - 

Pv^gi ~ P\v±gi> we have 

\\N(u, 1/4)" -N(0, l/4) B ||tv *S 2||u|| (by Fact 4.13) 

«S 2||g 1 || • d(V, W) < 20y/Bn\og(Bn) ■ d(V, W). 

Noting that u + N{0, 1/4)' 1 = N(u, 1/4)", it follows 

\\P$ gl + N(0, 1/4)" - P±gi + N(0, l/4)"|| tv = \\N(u, 1/4)" - N(0, l/4)"|| tv 

< 20^Bn\og{Bn)-d{V, W). 

Finally, since the condition \\gi\\ ^ lO-JBnlog(Bn) has probability 1 - l/(Bn) 5 , removing it can 
only increase the statistical distance of the two variables by additive l/(Bn) 5 . ■ 

5 An Adaptive Reconstruction Attack 

We next state and prove our main theorem. It shows that no function /: ]R" — > {0, 1} that 
depends only on a lower dimensional subspace can correctly predict the ^|-norm up to a 
factor Bona polynomial number of adaptively chosen inputs. Here, B can be any factor and 
the complexity of our attack will depend on B and the dimension of the subspace. We will 
in fact show a more powerful distributional result. This result states that no such function 
can predict the ^2" n orm on a rather natural sequence of distributions even if we allow the 
function to err on each distribution with inverse polynomial probability. This distributional 
strengthening will be useful in our application to compressed sensing later on. The next 
definition formalizes the way in which a linear sketch will fail under our attack. 

Definition 5.1 (Failure certificate). Let B > 8 and let / : K n -> {0, 1}. We say that a pair (V, a 2 ) 
is a d -dimensional failure certificate for f if V c IR" is d-dimensional subspace and o 2 e [0, 2B] 
such that for some constant C > 0, we have n > d + lOClog(Bn) and moreover: 

- Either o 2 e [B/2, 50B] and TP g „ G( v± ta z) {f(g) = 1} < 1 - (B«)" c , 

- or a 2 ^ 2 and P^G(y^) {f(g) = 1}> «" C - 

The motivation for the previous definition is given by the next simple fact showing that 
a failure certificate always gives rise to a distribution over which / does not decide the 
GapNorm problem up to a factor Q(B) on a polynomial number of queries. We note that 
in Section 5.3 we strengthen this concept to give a distribution where / errs with constant 
probability. 
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Fact 5.2. Given a d-dimensional failure certificate for f , we can find with poly(Bn) non-adaptive 
queries with probability 2/3 an input x such that either \\x\\ 2 ^ B(n - d)/3 and f(x) = or 
\\x 2 \\ < 3(n - d) and f(x) = 1. 

Proof. Sample 0((Bn) c ) queries from G(y- L , a 2 ). Suppose a 2 ^ 2. Since n-dis sufficiently 
large compared to d, by a union bound and Gaussian concentration, we have that with high 
probability simultaneously for all queries x, \\x\\ 2 < 3(n - d). On the other hand, with high 
probability / outputs 1 on one of the queries. The case where a 2 > B/2 follows with the 
analogous argument. ■ 

Our next theorem shows that we can always find a failure certificate with a polynomial 
number of queries. 

Theorem 5.3 (Main). Let B > 8. Let A C ]R" be a r- dimensional subspace of 1R" such that 
n^r + 901og(f>r). Assume that B < poly(«). Let f : 1R' ! —> {0, 1} satisfying f(x) = f(P^x) for all 
x e IR". Then, there is an algorithm that given only oracle access to f finds with probability 9/10 a 
failure certificate for f. The time and query complexity of the algorithm is bounded by poly(B, r). 
Moreover, all queries that the algorithm makes are sampled from G(V r± ,a 2 ) for some VcIR" and 
a 2 6(0,B]. 

We next describe the algorithm promised in Theorem 5.3 in Figure 2. 
5.1 Proof of Theorem 5.3 

Proof. We may assume without loss of generality that n — r + 901og(£>r) by working with the 
first r + 901og(Br) coordinates of IR". This ensures that a polynomial dependence on n in our 
algorithm is also a polynomial dependence on r. 

For each 1 ^ t < t, let W t c A be the closest (t - l)-dimensional subspace to V t that is 
contained in A. Formally W t satisfies 

d(V t ,W t ) = mm{d(V t ,W): dim(W) = t- 1, W C A). (11) 

Note that here we identify V t with the subspace that is spanned by the vectors contained in 
V t . We will maintain (with high probability) that the following invariant is true during the 
attack: 

Invariant at step t: 

dim(V f ) = t-l and d(V t ,W t )< -— . 35 [ . (12) 

20(Bn) ;S • ^ log(BM) 2 ■ :, 

Note that the invariant holds vacuously at step 1, since V\ = {0} C A. Informally speaking, our 
goal is to show that either the algorithm terminates with a failure certificate or the invariant 
continues to hold. Note that whenever the invariant holds in a step t, we must have 

d{Vt ' Wt ^ 20B3.5 M 2.5 1 log(BM) 2.5- 
Hence, Lemma 4.14 shows that for every a 2 € (0, B], 



\\G(V t \ a 2 ) - G( W t \ a 2 )\\ tv < 20^Bn\og(Bn) ■ d(V t , W t ) + < . (13) 

{Bny B i n l \og(Bn) 1 

This observation leads to the following useful lemma. 
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Input: Oracle A providing access to a function / : W 1 — » {0, 1}, parameter B > 4. 

Attack: Let V 1 = {0},m = 0(B 13 « n log 15 (n)), and S = [3/4,B] n £Z where e = 
l/20(Bn) 2 log(B«). 

For f = 1 to t = r + 1 : 

1. For each o 2 e S : 

(a) Sample g\,...,g m ~ G(V t J ",a 2 ). Query .4 on each g,-. Let flj = .4(gi). 

(b) Let s(t,a 2 ) = i JL^Lj «j denote the fraction of samples that are positively labeled. 

- If either a 2 > B/2 and s(t,o 2 ) < or, a 2 < 2 and s(t,<7 2 ) > £, then terminate 
and output (V^o 2 ) as a purported failure certificate. 

- Else let g[,---,g' m i be the vectors such that A(gj) = 1 for all i e [m']. 

(c) If m' < m/\00B 2 n, proceed to the next a. Else, compute v a e 1R' ! as the maximizer 
of the objective function z(v) = Y-!i=\( v >g'i) 2 ■ 

2. Let v* denote the first vector v a that achieved objective function (o 2 + 1/4) + A where 
A = l/7Br. 

- If no such v a was found, let V f+ i = V t and proceed to the next round. 

- Else let v t -v* - MV " gt/t , ' AM and put V f+ i = V t U fv t }. 

Figure 2: Reconstruction Attack on Linear Sketches. The algorithm iteratively builds a subspace V t 
that is approximately contained in the unknown subspace A. In each round the algorithm queries A 
on a sequence of queries chosen from the orthogonal complement of V t . As the dimension of V t grows 
larger, the oracle must make a mistake. 



Lemma 5.4. Assume that the invariant holds at step t. Then, if f is (a,B)-correct on V^~, then f 
is {a + e,B)-correct on W^. 

Proof. Equation 14 implies that for every a 2 € (0,B], the statistical distance between G(V r f J ", a 2 ) 
and G{VJ^-,o 2 ) is at most e. Hence, the correctness conditions from Definition 4.10 hold up 
to an £-loss in the probabilities. ■ 

Let E denote the event that the empirical estimate s(r,cr 2 ) is accurate at all steps of the 
algorithm. Formally: 



VtVa 2 eS : 



Lemma 5.5. P{£} > 1 -exp(-n). 



s(t,a 2 )- P {f(g) = l} 

G(VJ x ,ff a ) 



Proof. The claim follows from a standard application of the Chernoff bound, since we chose 
the number of samples m » (Bn/e) 2 . ■ 

Lemma 5.6. Under the condition that E occurs, the following is true: If the algorithm terminates 
in round t and outputs G{V t ^,o 2 ), then G{V^~, a 2 ) is a failure certificate for f . Moreover, if the 
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algorithm does not terminate in round t and the invariant holds in round t, then f is B-correct on 
W ± . 

Proof. The first claim follows directly from the definition of a failure certificate and the 
condition that the empirical error given by s{t,o 2 ) is £-close to the actual error. 

The second claim follows from Lemma 5.4. Indeed by the condition E and the assumption 
that the algorithm did not terminate, we must have that / is (2e, B)-correct on V^~. By 
Lemma 5.4, this implies that / is (3e, B)-correct on Wj 1 . Note that 3e ^ l/10(Bn) 2 and hence 
/ is correct on Wj 1 . ■ 

The next lemma is crucial as it shows that the invariant continues to hold with high 
probability assuming that / continues to be correct. 

Lemma 5.7 (Progress). Let t < r. Assume that the invariant holds in round t and that f is 
B-correct on Wj 1 . Then, with probability 1 - l/n 2 the invariant holds in round t + \. 

We will carry out the proof of Lemma 5.7 in Section 5.2. But before we do so we will 
conclude the proof of Theorem 5.3 assuming that the previous lemma holds. In order to 
do so, we argue that if we reach the final round and the invariant holds, we have effectively 
reconstructed all of A. Hence, it must be the case that / is no longer correct on as shown 
next. 

Lemma 5.8. Suppose that the invariant holds for t = r + 1. Then, f is not B-correct on W t . 

Proof. Since t = r + 1 and the invariant holds, we have dim( V t ) = dim( W t ) = r. On the other 
hand W t C A and dim(A) = r. Hence, W t = A. Therefore, the function / cannot distinguish 
between samples from G(W f J ~, 2) and samples from G(W f J ", B). Thus, / must make a mistake 
with constant probability on one of the distributions. ■ 

Condition on the event that E occurs. Since £ has probability 1 -exp(-n), this affects the 
success probability of our algorithm only by a negligible amount. Under this condition, if 
the algorithm terminates in a round t with t < r, then by Lemma 5.6, the algorithm actually 
outputs a failure certificate for /. On the other hand, suppose that we do not terminate in 
any of the rounds t ^ r. By the second part of Lemma 5.6, this means that in each round t 
it must be the case that / is correct on assuming that the invariant holds at step t. In 
this case we can apply Lemma 5.7 to argue that the invariant continues to hold in round 
t + 1. Since the invariant holds in step I, it follows that if the algorithm does not terminate 
prematurely, then with probability (1 - l/n 2 ) r > 1 - l/n the invariant still holds at step r + 1. 
But in this case, W r+1 is not correct for / by Lemma 5.8 and hence by Lemma 5.6 we output 
a failure certificate with probability 1 - exp(-n). Combining the two possible cases, it follows 
that the algorithm successfully finds a failure certificate for / with probability 1 - 2/n. This 
is what is required by Theorem 5.3. 

It therefore only remains to argue about query complexity and running time. The query 
complexity is polynomially bounded in n and hence also in r since we may assume that 
n ^ O(r) as previously argued. Computationally, the only non-trivial step is finding the vector 
v a that maximizes z(v) = YJi=\( v o'Si)^- We claim that this vector can be found efficiently 
using singular vector computation. Indeed, let G be the m' x n matrix that has g\,---,g' m 
as its rows. The top singular vector v of G, by definition, maximizes ||Gv|| 2 = YJ?=\{%i> v ) 2 - 
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Hence, it must also maximize the z(v). This shows that the attack can be implemented in 
time polynomial in r. This concludes the proof of Theorem 5.3 ■ 



5.2 Proof of the Progress Lemma (Lemma 5.7) 



Let t ^ r. Assume that the invariant holds in round t. Further assume that / is B-correct on 
Wj- 1 . Recall that under these assumptions, by Equation 14, for every a 2 6 (0, B], 



def , 



5 =1G(V^, o 2 )-G{Wt, cr 2 )|| tv < : 



B 3 n 2 log(B« 



.2 ' 



(14) 



Our goal is to show that with probability 1 - l/n 2 , the invariant holds in round t + 1. To prove 
this claim we will invoke the conditional expectation lemma (Lemma 4.12) in the analysis of 
our algorithm. 

Lemma 5.9. Assume that f is correct on W^. There exists a a 2 e S, A > Jk- and u e n A sucfo 
that for g~ G{V t ^,d 2 ), we have TP {/ '(g) = 1}> l/60B 2 r and 



E 



<",g> 2 /(g) = 1 



>E[( M ,g> 2 ] + A. 



Proof. By our assumption, satisfies the assumptions of the conditional expectation 
lemma (Lemma 4.12) so that there exists u eU = n A and a 6 [3/4, B], such that 



E 

G(W t V 2 ) 



<«,g> 2 /(^) = 1 



>E[( U ,g> 2 ] + 



4Br 



and IP {/(g) = 1} > 4Q g 2? . . On the other hand, by Equation 14, we know that G(W t J -,a 2 ) is 
^-statistically close to G(V t J ",a 2 ) and that <5 = o(\/B 2 n). We claim that therefore 



E 

G(y f x nA,ff 2 ) 



<",g> 2 /(g) = 1 



5: 



E[(w,g> : 



+ 



6Br 



(15) 



and IP {/(g) = 1} > 5Q g2 r • The latter statement is immediate because /(g) € {0, 1} and hence 
IP {/(g) = 1} can differ by at most 6 = o(l/B 3 rlog(rB)) between the two distributions. This 
further implies that the statistical distance between the two distributions under the condition 
that /(g) = 1 can only increase by a factor of 50B 2 r. Formally, for every distinguishing 
function R : W -> [0,D], we have 



E 



R(g) /(g) = i 



E 



R(g) /(g) = i 



^50B 2 rD6. 



(16) 



On the other hand, E{(u,g) 2 > lO^cJ ^ exp{-£). Hence, we can truncate (u,g) 2 at D - 
10Blog(rB) without affecting either expectation by more than o(\/Br). Hence, by Equation 16 
and the fact that 0(B 3 rlog(rB))<5 = o(l/Br), we have 



E 

G(W, x ,a 2 ) 



(u,g) 2 /(g) = 1 



E 

G(V t x nA,o 2 ) 



(u,g) 2 /(g) = 1 



"(fir)" 



(17) 
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A similar argument shows that if we change a 2 by only o(l/B 2 n 2 ) additively Equation 17 
continues to hold up to a insignificant loss in the parameters. Hence, there exists a 2 in our 
discretization for which this claim is true. Finally since u eW t nA and the invariant holds 
for (V t , W t ), we have that \\Py u\\ > 1 - l/B 2 n 2 . Hence, the conclusion of the lemma also holds 
for some u e V t n A up to an additive o(\/Bn) loss in the expectation. ■ 

Informally speaking, the previous lemma suggests that for 6 6 S, the vector v s has 
exceptionally high objective value. Moreover, we need to show that any vector that has high 
objective value must be very close to subspace DA. This is formally argued next. The 
result follows from analyzing the top singular vector of the biased Gaussian matrix obtained 
by arranging all positively labeled examples into a matrix. This analysis uses standard 
techniques which we defer to Section 6 but rely on in the next lemma. 

Lemma 5.10. With probability 1 -exp(-n), the vector v* found by the algorithm in step t satisfies 

ll*Vrvl v *ll 2 > 1 " 7 • ( 18 ) 

* n 20(Bn)3-5log 4 (Bn) 

Proof. Let z* - max ff 2 6S z(v a ) denote the maximum objective value achieved by any a in 
round t of the algorithm. We will first lower bound z* using the information we have about 
a* from the previous lemma. To this end, we would next like to apply Lemma 6.3 to the 
conditional distribution of g ~ G(V r t " L ,a 2 ) conditioned on the event that f(g) = 1. Note that 
g' v ...,g' m , are uniformly sampled from this distribution. Furthermore, the probability that 
A outputs 1 on each sample is at least p > Q(l/B 2 n). This can be used to show that by a 
Chernoff bound, with probability 1 -exp(-n), we have that 

m ' > P^ > n(B ll n 10 log 15 (n)) 

We will apply Lemma 6.3 with the following setting of y: 

1 

y. 



(Bn) 35 log 4 (Bn) 

We need to verify the following conditions of Lemma 6.3. In doing so let V - DA and 
W = V 1 - = V t + A- 1 . Further, let x = a 2 + 1/4 and A be the parameter from Lemma 5.9. 

1. Any unit vector weW can be written as av + Bv;', where a 2 + f> 2 = I, and v,w' are unit 
vectors with v e V t and w' € A 1 - is orthogonal to v. But lE(v,g) 2 < 1/4. Moreover, the 
condition f(g) = 1 does not bias the distribution along directions inside A- 1 . Hence, 
~E(w',g) 2 ^ t. It follows that W.(w,g) 2 ^ x. Here we used that ~E(w',g)(v,g) - since g 
can be written as the sum of two independent spherical gaussians and v and w' are 
orthogonal. 

2. For every v e V, w e W , we have lE(v,g)(w,g) = 0. This is again because v and w are 
orthogonal and g can be written as the sum of two independent spherical Gaussians. 

3. By Lemma 5.9, there exists v € V such that lE(v,g) 2 ^ x + A where A > 1/7 Br. 
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4. Finally, for every u e 1R", we claim that £ 2 = ~V(u,g) 2 < 0(B 2 log 2 «) as it corresponds to 
the fourth moment of a Gaussian with variance C conditioned on an event of probability 
Q(l/poly(«)). Any such event can increase the fourth moment of 0(B 2 ) by at most an 
0(log 2 «) factor. 

Finally to apply Lemma 6.3 for the given value of y,A, and E, 2 , we need the number of 
samples to be 

We have thus verified all conditions of Lemma 6.3. It follows that with probability 1 - 
exp(-nlogn), 

A 

z >z(d)>l + — . 

On the other hand, let us call a a 2 e S, bad if for every unit vector u e n A we have for 
g~G(V^,o 2 ), 



E 



(u,g) 2 /(g) = 1 



^E[<u,g> 2 ] + ! 
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where A is the parameter from the previous lemma. We claim that every bad a 2 will achieve 
strictly smaller objective value, i.e., z(o) < 1 + A/18, with probability 1 -exp(-n). This follows 
from Theorem 6.2 similarly to how it was used in Lemma 6.3 except that we now use an 
upper bound on E(w,g) 2 also for u e n A. Hence, the maximizer of the objective function 
must correspond to a a* that satisfies the assumptions of Lemma 6.3 as we previously verified 
them for a up to a constant factor loss in A. Hence, by Lemma 6.3, we must have that v* 
satisfies with probability 1 - exp(-n), 

l|iVnAf*ll 2 >i-y. 



Lemma 5.11. Suppose (V t , W t ) satisfies the Invariant and suppose the vector v* found in round t 
satisfies Equation 18. Then, the Invariant holds for {V t+ \, W t+1 ). 

Proof. Let y = l/(Bn) 3 5 log 4 (Bn). By Equation 18, we have that ||-P[/ V *II 2 ^ 1 ~T> where U = 
Vf 1 n A. This means in particular that ||-Pa v *II 2 > 1 - y and ||-Py ( v*|| 2 < y. This implies that 

\\P A v t \\>l-0(y). 

We claim that this implies that 

f + 1 



d(V M , W t+1 ) ^ d(V t , W t ) + O(y) < - — . - + O - — - < 



(B«) 35 log 3 (B«) \(Bn) 3 - 5 log 4 (Bn)/ (Bn) 3 - 5 log(B« 



\3 ' 



for sufficiently large n. Since this is what we want to show, it only remains to show that the 
first inequality holds. To that end, note that we can write Py M = Py t + v t vj . Moreover, we have 
Pw M - Pw t + ww where w is some unit vector orthogonal to W t such that \\v t -w\\ ^ O(y). 
Hence, 

d{V M , W M ) = \\P Vm -P w J\i < \\Pv M -Pwjh + PtvJ ~ ™™ T h < d(V t , W t ) + O(y). 
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We have shown that with probability 1 - 0(exp(-«)), the conditions of Lemma 5.1 1 are 
met and in this case the Invariant holds for (V f+1 , W t+1 ). This is what we needed to show in 
order to conclude the proof of Lemma 5.7. 

5.3 Finding high error certificates via direct products 

In this section we derive a useful extension of our theorem. Specifically we show how to 
find strong failure certificates. These will be distributions as before of the form G(V- L , a 2 ), but 
now the algorithm fails with constant probability over this distribution rather than inverse- 
polynomial. In fact, our later application to compressed sensing will need precisely this 
extension of our theorem. The proof of this extension follows from a simple direct product 
construction which shows how to reduce the error probability of the oracle at intermediate 
steps exponentially while increasing the sample complexity only polynomially. 

Definition 5.12 (Strong Failure Certificate). Let B ^ 8 and let / : M" -> {0, 1 }. We say that a 
pair (V,a 2 ) is a d -dimensional failure certificate for f if V C 1R" is d-dimensional subspace and 
a 2 € [0, 2B] such that for some constant C > 0, we have n > d + 101og(n) and moreover: 

- Either a 2 e [B/2,50B] and Tg*, G (V±**){f(g) = 1} < 2/3, 

- or a 2 ^ 2 and W^y^ {f(g) = 1}> 1/3. 

The next theorem states that we can efficiently find strong failure certificates. 

Theorem 5.13. Let B ^ 8. Let A C JR" be an r-dimensional subspace of M n such that n ^ r + 
901og(Br). Assume that B < poly(n). Let /: R" — > {0,1} satisfying f(x) = f(P A x)for all x e W. 
Then, there is an algorithm that with probability 1/3 and poly(B, r) adaptive oracle queries to f 
finds a strong failure certificate for f. Moreover, all queries that the algorithm makes are sampled 
from G(V- L ,a 2 ) for some V C ]R' ! and a 2 6 (0,B]. 

Proof. Let q = 0(log(f>«)). Given the function /: IR" — > {0,1} that is invariant under the 
subspace A C ]R n , consider the function f 9 1 : W 11 —> {0, 1} defined as 



Note that f®^ is invariant under the subspace A®1, i.e., the q-fo\d direct product of A with 
itself. Further note that dim(A®l) - qr < qn-90q\og(Br). Hence, f®l satisfies the assumptions 
of Theorem 5.3. 

Now, let G(W J -,a 2 ) be the failure certificate returned by the algorithm guaranteed by 
Theorem 5.3. Recall from Definition 4.8 that a sample g ~ G(W- L ,a 2 ) satisfies g = Pw^-gi + g2 
where g\ ~ N{Q,a 2 ) qn and gj ~ N(0, l/qn)^ 1 . Let Ui,..., U q be coordinate subspaces so that 
Uj corresponds to the z'-th block of n coordinates in W". Clearly, we have the decomposition: 



Moreover, the subspaces Uj n W 1 - are orthogonal for i * Since g\ is a spherical Gaussian, 
this means that the random variables Pu,nw ± gi f° r i e [<?] are statistically independent of each 
other. The same is true for Pu,gi- Consider therefore the distribution G; = Pu i nw ± gi + PUig2 



f 0q (x l ,x 2 , ...,x q ) = Majority^*! ),f{x 2 ),...,f{x q )). 
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restricted to its n nonzero coordinates, i.e., we think of G; as a random variable in ]R n . In 
particular, G( W-Sct 2 ) = (G\,. . ., G q ) and we have established that this is a product distribution. 
It remains to argue that each G, has the form G( V- 1 , a 2 ; l/qn) for some subspace V. This is 
immediate if we take V — Uj n W 1 - thought of as a subspace of JR". The following fact is now 
an immediate consequence of the Majority function. 

Claim 5.14. If f is 9/10-correct on G; for a set of at least 2q/3 indices i e [q], then f®l is 
1 - l/(Bn) 5 correct on G(W- L ,a 2 ). 

Proof For x\,...,x q drawn from (G\,...,G q ), the expected number of correct answers by / 
on these samples is at least 2q/3 • 9/10 = 3/5. The probability that the number is below 
q/2 is therefore at most exp(-Q(^)) ^ (Bn)~ 5 for q - 0(log(B«)) using a standard Chernoff 
bound. ■ 

Our claim implies that / cannot be 9/10-correct on a 1/3 fraction of the distributions G,-, 
for i € [q], Outputting a random G ; - hence completes the proof of theorem. ■ 



6 Top singular vector of biased Gaussian matrices 

To analyze our algorithm it was necessary to understand the top singular vector of certain 
biased Gaussian matrices. In this section we will prove the necessary lemmas. We start with 
a standard discretization of the unit sphere. 

Lemma 6.1 (£-net for the sphere). For every c > 0, there is a set N C S"^ 1 of size \N\ < 
exp(0(nlog(l/c))) such that for every unit vector x e IR", there is a unit vector v e N satis- 
fying {x,v) 2 ^ c. 

We will need the following simple variant of the Chernoff-Hoeffding bound. 

Theorem 6.2 (Chernoff-Hoeffding). Let the random variables X\,...,X m be independent random 
variables. Let X = YJJLi %i an & £ 2 = VX. Then, for any t > 0, 

P{|X-EX|>f}s:exp 

The next lemma is an application of the previous bound. We used it earlier in the proof 
of Theorem 5.3. 

Lemma 6.3. Let x > 0. Let V be a subspace o/K". Let G be distribution over M" such that for 
g~ G we have: 

1. For every unit vector w 6 V- 1 , we have 1E(w,g) 2 < t. 

2. For every two unit vectors v e V, w € V- 1 , we have TE(v,g)(w,g) - 0. 

3. The maximum oflE(v,g) 2 over all unit vectors v eV is equal to r+Afor some A > l/poly(n). 

4. For every unit vector u e IR", we have V(u,g) 2 ^ E, 2 . 
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Let y > l/poly(n). Draw m = q/ "Wg^ 2 \ 

i.i.d. samples g\,---,g m ~G and let 



u = argmax ) (gi,u) 2 . 

= 1 TT* 



IHI - f=1 



T/zen, x^if/i probability 1 - exp(-nlog 2 «), we have \\P v u*\\ 2 > 1 - y. Moreover, 

1 m 



m t- — < 2 



Proof. Let gj, . ~ G be m i.i.d. samples from G. First consider the vector v eV guaranteed 
by the second assumption of the lemma. Let X = Hj(v,g,-) 2 . We have EX > rm + Am and 
VX ^ nE 2 . Hence, by Theorem 6.2, 

f , ymA\ I A 2 y 2 m 2 \ , , x 

P|X < (t + A)m - ^ | < exp \- j ^ 6 xp (-nlog 2 «) . 

On the other hand, let u = av + with a 2 + j3 2 = 1 be any vector such that v is a unit vector 
in V and if is a unit vector in V- 1 . Further assume that a 2 < 1 - y. Let Y = YJi=\{ u >gi) 2 - Note 
that 

E(u,g> 2 = a 2 E(v,g) 2 + p 2 V{w,g) 2 + a^{v,g){w,g) = a 2 E(v,g) 2 + l3 2 E(w,g) 2 . 
Here we used the assumption that ~E(v,g)(w,g) = 0. Hence, 

E(u,g> 2 = a 2 E(v,g> 2 + p 2 E(w,g) 2 < (1 - y)(r + A) + x/n = x + (1 - y)A. 
In particular, E Y < t + (1 - y)A. Applying Theorem 6.2 again, we have 

3yAm 



F< Y ^ (t + A)m - 



!) / A 2 y 2 m 2 \ / , 2 \ 

"P exp rc^^p exp (" nlog 



Note that there is a margin of mA/2 between the bound on Y and the bound on X. Let 
M C S"" 1 be the set M = {u : ||Pyu|| 2 < 1 - y}. Further let M be the set obtained from M by 
replacing each member of M with its nearest point in N where N be the discretization of the 
unit sphere given by Lemma 6.1 with the setting c = yA/8. Note that c > l/poly(«) and hence 
\N\ = exp(0(nlogn)). We claim that 



in in 

m ^ / (8i> u ) 2 < ma ? / (gu ") 2 + 

ueM ueM 



yAm 



1=1 1=1 



This is because each squared inner product can differ by at most yA/8 and there are m terms 
in the summation. On the other hand, we have by a union bound, 

p-Jmax Y(8i' u f > (r + A)m- 3yAm 1 < \N\-F [y > (r + A)m-^-^-\ 
{ueMf^ 4 j { 4 J 



< exp(0(«log«))exp(-Q(nlog 2 «)) 
^exp(-Q(nlog 2 (n))). 
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We conclude that with probability 1 -exp(-«log 2 n), 



3yAm yAm 5yAm 



Ev? / » i oy L\Yii yam 
\Si> u ) ^ ( T + S — + < (t + A)m 
4 8 

1 = 1 



and thus strictly smaller than the global maximum. This implies that the global maximizer 
u* must satisfy ||PyU*|| 2 > 1 -y, ■ 



7 Applications and Extensions 

In this section we derive various applications of our main theorem. 



7.1 Randomized Algorithms 

While we have stated our results in terms of algorithms that output a (deterministic) func- 
tion / of the sketch Ax, we obtain the same results for algorithms which use additional 
randomness to output a randomized function / of Ax. Indeed, the main observation is that 
our attack never makes the same query twice, with probability 1. It follows that for each 
possible hardwiring of the randomness of / for each possible input, we obtain a deterministic 
function, and can apply Theorem 5.13. 

Now consider the attack in Figure 2. In each round t, we now allow the algorithm to 
use a new function f t : TR n — > {0, 1} provided that f t (x) still only depends on P A x. Under this 
assumption, the proof of Theorem 5.3 (and thus also the one of Theorem 5.13) carries through 
the same way as before except that we replace / by f t in each round. What is crucial for the 
attack is only that the subspace A has not changed. 

We thus have the following theorem. 

Theorem 7.1. Let B ^ 8. Let A c ]R" be a r -dimensional subspace ofM n such that n ^ r+901og(Br). 
Assume that B ^ poly(n). Let f : 1R" — > {0, 1 } be a randomized algorithm for which the distribution 
on outputs f(x) only depends on P A x,for all x e K". Then, there is an algorithm that given only 
oracle access to f, which for each possible fixing of the randomness of f, finds with constant 
probability a strong failure certificate for f . The time and query complexity of the algorithm is 
bounded by poly(Br). Moreover, all queries that the algorithm makes are sampled from G(V- L , a 2 ) 
for some V C W and a 2 € (0,B]. Moreover, we may assume that for each query x ~ G(V^-,o 2 ) 
made by the algorithm in round t, the function f is allowed to depend on V^~. 

7.2 Approximating £ p norms 

Our main theorem also applies to sketches that aim to approximate any £ p -norm. A random- 
ized sketch Z for the ^-approximation problem depends only on a subspace A C 1R" and 
given x e IR" aims to output a number Z(x) satisfying 

\\x\\ p ^Z(x)^C-\\x\\ p . (19) 

The next corollary shows that we can find an input on which the sketch must be incorrect. 
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Corollary 7.2. Let 1 ^ p ^ oo. Lei Z be a randomized sketching algorithm for approximating the 
£p-norm which uses a subspace of dimension at most n - 0(\og(Cn)). Then, there is an algorithm 
which, with constant probability, given only poly(C«) oracle queries to Z(x), finds a vector x e JR" 
which violates Equation 19. 

Proof The result follows from Theorem 7.1, applied with approximation factor B = 0(C 2 n 2 ) 
to the function /: ]R" — > {0, 1} which outputs 1 if Z(x) ^ C 2 n 2 and otherwise. The theorem 
gives us a strong failure certificate from which we can find a vector x satisfying either ||x|| 2 > 
8Cn and f(x) = 0, or, \\x\\ 2 < 4i^fn and f(x) - 1. Using the fact that ||x|| p /Vn < ||x|| 2 < V"ll x llp> 
it follows that Equation 19 is violated. ■ 

7.3 Sparse recovery with an ^ 2 /^2"g uaran tee 

An £2/^2 sparse recovery algorithm for a given parameter k, is a randomized sketching 
algorithm which given Ax, outputs a vector x' that satisfies the approximation guarantee 
||x'-x|| 2 < C||x tai im|| 2 . Here x ta jia.) denotes x with its top k coordinates (in magnitude) replaced 
with 0. We will show how to find a vector x which causes the output x' to violate the 
approximation guarantee, for any fc > 1, 

Theorem 7.3. Consider any randomized £ 2 /£ 2 sparse recovery algorithm with approximation 
factor C < O(^Jn), sparsity parameter k > 1, and sketching matrix consisting of r - o(n/C 2 ) rows. 
Then there is an algorithm which, with constant probability, given only oracle access to x' finds a 
vector xeE" which violates the approximation guarantee with poly(«) adaptive oracle queries. 

Remark 7.4. We note that in general the r = 0(n/C 2 ) restriction cannot be improved, at least 
for small k, since there is an upper bound of 0((n/C 2 ) ■ k\og(n/k)). Indeed, with 0(k\og(n/k)) 
rows, there is a deterministic procedure to compute x' with Hx'-xHj ^ 4||x tfl ,7(j.)||j. By splitting the 

coordinates of x into n/C 2 blocks x 1 ,x nj/c2 and applying this procedure on each block, the total 
squared error is 4£y ll^^Hf ^ ^C 2 \\x tail{k) \\ 2 , which implies ||x'-x|| 2 < 2C\\x tail{k) \\ 2 . 

Proof. It suffices to prove the theorem for k = I, since extending the theorem to larger k can 
be done by appending k - 1 additional coordinates to the query vector, each of value +00. We 
will prove the theorem for each possible fixing of the randomness of function /, and so we 
can assume the function / is deterministic. 

We use the sparse recovery algorithm Alg to build an algorithm Alg for the GapNorm(B) 
problem for some value of B = ©(«). We will use Theorem 7.1 to argue that with constant 
probability Alg must have failed on some query in the attack. We use that in each round 
poly(«) queries x are drawn from a subspace Gaussian family G(V J -,a 2 ), for certain V 1 - and 
a which are chosen by the attacking algorithm, and vary throughout the course of the attack. 
Further, we use that the function / can depend on V- 1 . 

In a given round in the simulation, we have a subspace V- 1 . Let U - V 1 - n A with 
dimension r' ^ r. Let 

S = [i € [n] I e l P v ,e t > (\-k 2 /C 2 ) 1 ' 2 ), 

where k > is a sufficiently small constant to be determined. Notice that tr(Py±) > n - r and 
eiPy±ej ^ 1 for all i. The following is a simple application of Markov's bound. 

Lemma 7.5. Let x - 1 - a/C 2 for a constant a > 0. The number z of indices i for which ejPy±ej is 
larger than x is at least n - C 2 r/a. 
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Proof. 

n-r — ^^e ; -Py±e,- < z- 1 + (n -z) -x - (1 - x)z + nx, 



ch- 
cx ' 



and so (1 - x)(n - z) < r, or z > n - 

By Lemma 7.5, for appropriate r = jin/C 2 , where j5 > is a sufficiently small positive 
constant depending on ?c, we have that |S| > 2n/3, which holds at any round in the attack 
since tr(Py±) is always at least n-r. 

The attack: We design a function f(Ax), which is allowed and will depend on V 1 - (as well 
as A), to solve Gap-Norm(B). Our reduction is deterministic. Here x ~ G(y- L , o 2 ) is a query 
generated during the attack given by Theorem 7.1. 

Given Ax, the algorithm Alg' first computes the set S with |S| ^ 2n/3 with e ; Py±e ; - > 
(1 -k 2 /C 2 ) 1/2 for all z € S. Notice that S depends only on V- 1 . For i e S, let y ! = x+AC j \fnPv±ej. 
Given V 1 - and Ax (and A), AZg' can compute Ay' = Ax + ^CyjnAPy^ei. 

Let z' be the output of Alg run on input Ay 1 , for each z e S. If \z l - \ > Ca/m for all i e S, then 
A/g 7 sets f(x) = 0, otherwise it sets f(x) = 1. 

We assume for all queries x and all a chosen during the attack, that 1 1 | ^ € [na 
which happens with probability 1 - l/n w ^ by standard concentration bounds for the x 2 - 
distribution. To analyze this reduction, we distinguish two cases for which we require 
correctness for /. 

Case 1: WxW 2 , < 4m. By concentration bounds of the ^-distribution, we can assume a 2 ^ 8. 
Let i 6 S and consider y l . Then, 

Hytail(i) 112 ^ W-lCMeJPv-eMh 

< ||x|| 2 + 4CV«"(l-(e 1 Pyxe ! ) 2 ) 1/2 

^ 2V^ + 4CV«"(1 - (Vl - JC 2 /C 2 ) 2 ) 172 

< 2yfn + AKyfn 
= (2 + 4k)V«- 

By correctness of Alg, it follows that for the output z 1 of Alg, we have 

\\z' ' -y'h < C(2 + 4K)Vn- (20) 
This implies that z- must be at least Cyjn. Indeed, otherwise we would have 

llz'-yib > \4 ~y\\> 4C V«Vl - k 2 /C 2 - \x { \ - Cyfn > 2.5CV", 

contradicting (20) for k a sufficiently small constant (and C at least a sufficiently large 
constant, which we can assume since it only weakens the correctness requirement of Alg). 

Note that we can assume Alg is correct on a query x ~ G^V- 1 , a) (as otherwise we are 
already done), Hence, z\ must be at least Cy/n simultaneously for every i e S, with probability 

itrarily large s. We thus have: 



at least 1 - l/n s for arbi 



F if( x ) = I \\x\% < 4m} > 1 , (21) 
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Case 2: Bn/A ^ \\x\\\ < IOOBm, for a parameter B - 0(n). Recall that x is obtained by 
sampling g x ~ N(0, a 2 )", g 2 ~ ^(0, 1/4)", and setting x = P v ±gi + g 2 . Recall that U - V 1 nA, 
so that we have P^x = Pugi + PaSi- We can assume that a 2 ^ |, as mentioned above, by tail 
bounds on the ^-distribution. 

Associate U with an r'xn orthonormal matrix (r' ^ r) whose rows span U . Since the rows 
of U are orthonormal, ^ r' , and so by averaging, for at least 2«/3 of its columns Uj, we 
have \\UA\l < 3r'/n < 3r/«. Let T = {j \ \\Uj\\l < 3r/n}. 

Fix a j e T, and consider y = x + 4Cy/nP v ±ej. For x ~ G(V r - L , a 2 ), we start by upper 
bounding the variation distance between the distributions of random variables P^x = Pjjgi + 
PaZi and P^y = PAX+4:C^P^-P v ±ej - Pugi + ^C^P u ej + P A g 2 . The variation distance cannot 
decrease by fixing g 2 , so it suffices to upper bound the variation distance between Pjjgi 
and Pugi + 4iCy/nPjjej. Since Pjj is a projection matrix, there is a 1-to-l map from these 
distributions to Ugi and Ugi + AC^Uej, so we upper bound the variation distance between 
the latter two distributions. 

By rotational invariance of the Gaussian distribution, Ugi ~ N(0, a 2 Id r '), while Ugi + 
4,C^fnUej ~N{4,C^fnUj,o 2 ld r >),w]\exe Uj is the ;-th column of U. Applying Fact 4.13, and 
using a 1 > B/8, 

2 M , x^-Crr \ll ^4^-11^11, V8-4CV^-V3^ 



\\N(0,a 2 ld,,)-N(ACy[nUj,a 2 ld r ,)\\ tv € — — ^ 



By the triangle inequality, for any ],]' e T, 



\\N{4C\/nUj,0 2 Id r ,) - N(4CVnl/ ; v, a 2 Id r <)lltv < 



Vb 

Vb Vb 

16V6 a /m/ 



Vb 

and so the variation distance between the distributions of random variables P A (x+4iCy/nP v ±ej) 
and Py\(x+4:Cy/nPy±ej') is at most 16 ^V"^ _ p or any constant y > 0, we can choose the constant 

j3 sufficiently small so that for B - y 2 n, this variation distance is at most 1/100. We fix such a 
[>, for a y to be determined below. 

Fix anieSnT. Consider the output z 1 of Alg given Ay 1 . Using that ||x|| 2 < IOVBm = \0yn, 
we have 

lljiiimlla < Wy'-ACMeJPv^eih 



< |W| 2 + 4CV^(l-(e ! -Pyxe ! ) 2 ) 1/2 

^ 10yn + 4CV"(1 - (Vl - k 2 /C 2 ) 2 ) 172 

< I0yn + ^-K^n. 



If AZg succeeds, 



lly'll 2 + c||j; t ' ail(1) || 2 

«S \\x\\ 2 + ^C^+C{10yn + kK^[n) 
^ IQyn + ACyfn+lQCyn + AicCyfn 
^ CCn, 
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where C > is a constant that can be made arbitrarily small by making y > arbitrarily small 
(and assuming n large enough). Hence, \\z l \\\ ^ C 2 C 2 « 2 . It follows that |zj| ^ Cyfn for at most 

C 2 « values of j. Now we use the following facts if Alg succeeds. 

1. \S\>2n/3, 

2. \T\>2n/3, 

3. for all i, z 1 contains at most C 2 « values of j for which |zj| ^ C^fn 

A. for any ],]' e T, the variation distance between the distributions of random variables 
P A (x + AC^JnPy^ej) and P A (x + AC^nPyxey) is at most 1/100. 

The first and second conditions imply that |S n T| ^ 2n/3 - n/3 = n/3. The third and fourth 
conditions imply that for any i e S fl T, if AZg succeeds (which we can assume), then 

x~G(V^,a 2 ) [ ' jeSnTx~G(V^,a 2 ) [ ' ' 100 

C 2 « l l 

«S ■ L -r-+ < — , 

n/3 100 10 

where the last inequality follows for sufficiently small constant C > 0. Hence, 

F (f(x) = I Bn/4 < < lOOBn) < — . (22) 

Combining, (21) and (22), for sufficiently large n we have for all V 1 - and a chosen throughout 
the course of the attack, 



F 

x~G(V^,a 2 ) 



2 {{f(x) = 1 A ||x|| 2 < An) V (f(x) = A Bn/A ^ \\x\\ 2 2 ^ 100Bn)} < — . 



Wrap-up: We have built a function / for Gap-Norm(B) which has distributional error 
less than 1/10 on G(V^,a 2 ) for any V 1 - and a chosen throughout the course of the attack, 
whenever \\x\^ < An or Bn/A < Hxll 2 < lOOBn. The reduction is deterministic and holds for each 
setting of the randomness of /. It follows by Theorem 7.1, that with constant probability, with 
poly(n) adaptive oracle queries to f, we will find a strong failure certificate (V ,o 2 ) for / (for 
each fixing of its randomness). In this case, either a 2 > B/2 and lP g ^G(v ± ,a 2 ) {/(&) = 1} ^ 2/3, 
which would violate (22), or a 2 < 2 and F^^yj.^) {f(g) * 0} ^ 1/3, which would violate 
(21). It follows that our assumption that the compressed sensing algorithm Alg succeeded 
on all queries made was false, which implies that we have found a query to Alg violating its 
approximation guarantee. This completes the proof. ■ 
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